130 ◾ Bioinformatics
1. Acquiring the raw data in FASTQ files.
2. Downloading the sequence of the reference genome, indexing (with samtools and
bwa) the reference sequence, and creating a dictionary (with Picard) for the reference
sequence.
3. Mapping the raw reads in FASTQ files to a reference genome using any of the aligner
(e.g., BWA) to generate aligned reads in SAM files.
4. Converting SAM files to BAM files using samtools.
5. Sorting and indexing the alignment in the BAM files using samtools.
6. Extracting the chromosome or the interval of interest, otherwise, skipping this step
if you are interested in the variants of the whole genome.
7. Removing or marking duplicate alignments in the BAM files using Picard and index-
ing the resulted BAM with samtools.
8. Adding RG header to the BAM files using Picard and indexing the resulted BAM files
with samtools.
9. Applying Base Quality Score Recalibration (BQSR) for detecting systematic errors
made by the sequencer.
10. Identification of the genotype at all positions per sample. This step involves
“HaplotypeCaller”, which takes a BAM file as input and outputs a GVCF file format
for each sample.
11. The next step in the GATK workflow is to consolidate the GVCF files of the multiple
samples and the variants are called across all samples in the cohort forming a single
VCF file. This step involves GenomicsDBImport and GenotypeGVCFs.
12. Separation of SNPs and InDels in separate VCF files using GATK SelectVariants.
4.2.2.2.1 Acquiring raw data
The raw sequence data usually is in FASTQ format. This time, we will use human raw data
from 1000 Genome project to demonstrate variant calling using GATK4 best practice. For
simplicity, instead of the whole genome, we will extract only the reads of chromosome
21 of 13 individuals including males and females from Africa, Asia, and Europe. Table
4.2 shows the NCBI SRA data from which the FASTQ files for chromosome 21 will be
extracted and sample information.
To keep the files organized, we can create a directory for this project “vcfgatk”, and
inside the directory, we will create a text file “ids.txt” that contains only the above ID col-
umn without a column header and each ID in a line. Then, while you are inside the project
directory, you can save the following bash script in a file “download.sh” and execute it as
“bash download.sh”: